:orphan: Core Basics 2: Train a Classifier on a Star Multi-Table Dataset =============================================================== In this notebook we learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the *Core Basics 1* lesson if you are not familiar with Khiops. Make sure you have installed `Khiops `__ and `Khiops Visualization `__. We start by importing Khiops, checking its installation and defining some helper functions: .. code:: ipython3 import os import platform import subprocess from khiops import core as kh # Define peek helper function def peek(file_path, n=10): """Shows the first n lines of a file""" with open(file_path, encoding="utf8", errors="replace") as file: for line in file.readlines()[:n]: print(line, end="") print("") # If there are any issues you may Khiops status with the following command # kh.get_runner().print_status() Training a Multi-Table Classifier ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We’ll train a “sarcasm detector” using the dataset ``HeadlineSarcasm``. In its raw form, it contains a list of text headlines paired with a label that indicates whether its source is a sarcastic site (such as `The Onion `__) or not. We have transformed this dataset into two tables such that the text-label record :: "groundbreaking study finds gratification can be deliberately postponed" yes is transformed to an entry in a table that contains id-label records :: 97 yes and various entries in a secondary table linking a headline id to its words and positions :: 97 0 groundbreaking 97 1 study 97 2 finds 97 3 gratification 97 4 can 97 5 be 97 6 deliberately 97 7 postponed Thus the ``HeadlineSarcasm`` dataset has the following multi-table schema :: +-----------+ |Headline | +-----------+ +-------------+ |HeadlineId*| |HeadlineWords| |IsSarcastic| +-------------+ +-----------+ |HeadlineId* | | |Position | +-1:n--->|Word | +-------------+ The ``HeadlineId`` variable is special because it is a *key* that links a particular headline to its words (a 1:n relation). *Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.* To train a classifier with Khiops in this multi-table setup, this schema must be codified in the dictionary file. Let’s check the contents of the ``HeadlineSarcasm`` dictionary file: .. code:: ipython3 sarcasm_kdic = os.path.join("data", "HeadlineSarcasm", "HeadlineSarcasm.kdic") print(f"HeadlineSarcasm dictionary file: {sarcasm_kdic}") print("") peek(sarcasm_kdic, n=15) .. parsed-literal:: HeadlineSarcasm dictionary file: data/HeadlineSarcasm/HeadlineSarcasm.kdic Root Dictionary Headline(HeadlineId) { Categorical HeadlineId; Categorical IsSarcasm; Table(Words) HeadlineWords; }; Dictionary Words(HeadlineId) { Categorical HeadlineId; Numerical Position; Categorical Word; }; As in the single-table case the ``.kdic``\ file describes the schema for both tables, but note the following differences: - The dictionary for the table ``Headline`` is prefixed by the ``Root`` keyword to indicate that is the main one. - For both tables, their dictionary names are followed by ``(HeadlineId)`` to indicate that ``HeadlineId`` is the key of these tables. - The schema for the main table contains an extra special variable defined with the statement ``Table(Words) HeadlineWords``. This is, in addition to sharing the same key variable, is necessary to indicate the ``1:n`` relationship between the main and secondary table. Now let’s store the location main and secondary tables and peek their contents: .. code:: ipython3 sarcasm_headlines_file = os.path.join("data", "HeadlineSarcasm", "Headlines.txt") sarcasm_words_file = os.path.join("data", "HeadlineSarcasm", "HeadlineWords.txt") print(f"HeadlineSarcasm main table file: {sarcasm_headlines_file}") print("") peek(sarcasm_headlines_file, n=3) print(f"HeadlineSarcasm secondary table file location: {sarcasm_words_file}") print("") peek(sarcasm_words_file, n=15) .. parsed-literal:: HeadlineSarcasm main table file: data/HeadlineSarcasm/Headlines.txt HeadlineId IsSarcasm 0 yes 1 no HeadlineSarcasm secondary table file location: data/HeadlineSarcasm/HeadlineWords.txt HeadlineId Position Word 0 0 thirtysomething 0 1 scientists 0 2 unveil 0 3 doomsday 0 4 clock 0 5 of 0 6 hair 0 7 loss 1 0 dem 1 1 rep. 1 2 totally 1 3 nails 1 4 why 1 5 congress The call to the ``train_predictor`` will be very similar to the single-table case but there are some differences. The first is that we must pass the path of the extra secondary data table. This is done with the ``additional_data_tables`` parameter that is a Python dictionary containing key-value pairs for each table. More precisely: - keys describe *data paths* of secondary tables. In this case only :literal:`Headline`HeadlineWords` - values describe the *file paths* of secondary tables. In this case only the file path we stored in ``sarcasm_words_file`` *Note: For understanding what data paths are see the “Multi-Table Tasks” section of the Khiops ``core.api`` documentation* Secondly, we specify how many features/aggregates Khiops will create with its multi-table AutoML mode. For the ``HeadlineSarcasm`` dataset Khiops can create features such as: - *Number of different words in the headline* - *Most common word in the headline before the third one* - *Number of times the word ‘the’ appears* - … It will then evaluate, select and combine the created features to build a classifier. We’ll ask to create ``1000`` of these features (the default is ``100``). With these considerations, let’s setup the some extra variables and train the classifier: .. code:: ipython3 sarcasm_results_dir = os.path.join("exercises", "HeadlineSarcasm") sarcasm_report, sarcasm_model_kdic = kh.train_predictor( sarcasm_kdic, dictionary_name="Headline", # This must be the main/root dictionary data_table_path=sarcasm_headlines_file, # This must be the data file for the main table target_variable="IsSarcasm", results_dir=sarcasm_results_dir, additional_data_tables={"Headline`HeadlineWords": sarcasm_words_file}, max_constructed_variables=1000, # by default Khiops constructs 100 variables for AutoML multi-table max_trees=0, # by default Khiops constructs 10 decision tree variables ) print(f"HeadlineSarcasm report file located at: {sarcasm_report}") print(f"HeadlineSarcasm modeling dictionary file located at: {sarcasm_model_kdic}") .. parsed-literal:: HeadlineSarcasm report file located at: exercises/HeadlineSarcasm/AllReports.khj HeadlineSarcasm modeling dictionary file located at: exercises/HeadlineSarcasm/Modeling.kdic We now may take a look at the results with the visualization tool: .. code:: ipython3 # To visualize uncomment the line below # kh.visualize_report(sarcasm_report) *Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops ``sort_data_table`` function or your favorite software. The examples of this tutorial have their tables pre-sorted.* Exercise time! ~~~~~~~~~~~~~~ Repeat the previous steps with the ``AccidentsSummary`` dataset. It describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema: :: +---------------+ |Accidents | +---------------+ |AccidentId* | |Gravity | |Date | |Hour | +---------------+ |Light | |Vehicles | |Department | +---------------+ |Commune | |AccidentId* | |InAgglomeration| |VehicleId* | |... | |Direction | +---------------+ |Category | | |PassengerNumber| +---1:n--->|... | +---------------+ So for each accident we have its characteristics (such as ``Gravity`` or ``Light`` conditions) and those of each involved vehicle (its ``Direction`` or ``PassengerNumber``). The main task for this dataset is to predict the variable ``Gravity`` that has two possible values:``Lethal`` and ``NonLethal``. We first save the paths of the ``AccidentsSummary`` dictionary file and data table files into variables: .. code:: ipython3 accidents_kdic = os.path.join( kh.get_samples_dir(), "AccidentsSummary", "Accidents.kdic" ) accidents_data_file = os.path.join( kh.get_samples_dir(), "AccidentsSummary", "Accidents.txt" ) vehicles_data_file = os.path.join( kh.get_samples_dir(), "AccidentsSummary", "Vehicles.txt" ) Print the file locations and use the function ``peek`` to list their contents ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Which table is the ``Root`` in this case? .. code:: ipython3 print(f"Accidents dictionary file: {accidents_kdic}") print("") peek(accidents_kdic, n=40) print(f"Accidents (main) data table: {accidents_data_file}") print("") peek(accidents_data_file) print(f"Vehicles data table: {vehicles_data_file}") print("") peek(vehicles_data_file) .. parsed-literal:: Accidents dictionary file: /github/home/khiops_data/samples/AccidentsSummary/Accidents.kdic Root Dictionary Accident(AccidentId) { Categorical AccidentId; Categorical Gravity; Date Date; Time Hour; Categorical Light; Categorical Department; Categorical Commune; Categorical InAgglomeration; Categorical IntersectionType; Categorical Weather; Categorical CollisionType; Categorical PostalAddress; Table(Vehicle) Vehicles; }; Dictionary Vehicle(AccidentId, VehicleId) { Categorical AccidentId; Categorical VehicleId; Categorical Direction; Categorical Category; Numerical PassengerNumber; Categorical FixedObstacle; Categorical MobileObstacle; Categorical ImpactPoint; Categorical Maneuver; }; Accidents (main) data table: /github/home/khiops_data/samples/AccidentsSummary/Accidents.txt AccidentId Gravity Date Hour Light Department Commune InAgglomeration IntersectionType Weather CollisionType PostalAddress 201800000001 NonLethal 2018-01-24 15:05:00 Daylight 590 005 No Y-type Normal 2Vehicles-BehindVehicles-Frontal route des Ansereuilles 201800000002 NonLethal 2018-02-12 10:15:00 Daylight 590 011 Yes Square VeryGood NoCollision Place du général de Gaul 201800000003 NonLethal 2018-03-04 11:35:00 Daylight 590 477 Yes T-type Normal NoCollision Rue nationale 201800000004 NonLethal 2018-05-05 17:35:00 Daylight 590 052 Yes NoIntersection VeryGood 2Vehicles-Side 30 rue Jules Guesde 201800000005 NonLethal 2018-06-26 16:05:00 Daylight 590 477 Yes NoIntersection Normal 2Vehicles-Side 72 rue Victor Hugo 201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 590 052 Yes NoIntersection LightRain Other D39 201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 590 133 Yes NoIntersection Normal Other 4 route de camphin 201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 590 011 Yes NoIntersection Normal Other rue saint exupéry 201800000009 NonLethal 2018-02-18 15:57:00 Daylight 590 550 No NoIntersection Normal Other rue de l'égalité Vehicles data table: /github/home/khiops_data/samples/AccidentsSummary/Vehicles.txt AccidentId VehicleId Direction Category PassengerNumber FixedObstacle MobileObstacle ImpactPoint Maneuver 201800000001 A01 Unknown Car<=3.5T 0 None Vehicle RightFront TurnToLeft 201800000001 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront NoDirectionChange 201800000002 A01 Unknown Car<=3.5T 0 None Pedestrian None NoDirectionChange 201800000003 A01 Unknown Motorbike>125cm3 0 StationaryVehicle Vehicle Front NoDirectionChange 201800000003 B01 Unknown Car<=3.5T 0 None Vehicle LeftSide TurnToLeft 201800000003 C01 Unknown Car<=3.5T 0 None None RightSide Parked 201800000004 A01 Unknown Car<=3.5T 0 None Other RightFront Avoidance 201800000004 B01 Unknown Bicycle 0 None Vehicle LeftSide None 201800000005 A01 Unknown Moped 0 None Vehicle RightFront PassLeft We now save the results directory for this exercise: .. code:: ipython3 accidents_results_dir = os.path.join("exercises", "AccidentSummary") print(f"AccidentsSummary exercise results directory: {accidents_results_dir}") .. parsed-literal:: AccidentsSummary exercise results directory: exercises/AccidentSummary Train a classifier for the ``Accidents`` database with 1000 variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Save the resulting file locations into the variables ``accidents_report`` and ``accidents_model_kdic`` and print them. Do not forget: - The target variable is ``Gravity`` - The key for the ``additional_data_tables`` parameter is :literal:`Accident`Vehicles` and its value that of ``vehicles_data_file`` - Set ``max_trees=0`` .. code:: ipython3 accidents_report, accidents_model_kdic = kh.train_predictor( accidents_kdic, dictionary_name="Accident", data_table_path=accidents_data_file, target_variable="Gravity", results_dir=accidents_results_dir, additional_data_tables={"Accident`Vehicles": vehicles_data_file}, max_constructed_variables=1000, max_trees=0, ) print(f"AccidentsSummary report file: {accidents_report}") print(f"AccidentsSummary modeling dictionary: {accidents_model_kdic}") .. parsed-literal:: AccidentsSummary report file: exercises/AccidentSummary/AllReports.khj AccidentsSummary modeling dictionary: exercises/AccidentSummary/Modeling.kdic Take a look to the report ^^^^^^^^^^^^^^^^^^^^^^^^^ Which variables predict well the gravity of an accident? .. code:: ipython3 # To visualize uncomment the line below # kh.visualize_report(accidents_report)